Caida Data Analysis by Jeffrey Hsieh

Center for Applied Internet Data Analysis (CAIDA) has been gathering and providing data of Internet for scentific research community. In this project, one publically available data, Trace Statistics for CAIDA Passive OC48 and OC192 Traces was gather and studied.

Univariate Plots Section

The data is a space separated text files with some comments. After read into data frame, it was found to have the following columns.

##  [1] "SIZE"     "X.IPv4."  "SCTP"     "IPv6"     "ESP"      "UDP"     
##  [7] "GRE"      "ICMP"     "TCP"      "UNKNOWN"  "X.IPv6."  "ICMP6"   
## [13] "UDP.1"    "TCP.1"    "X.IPv6t." "ICMP6.1"  "UDP.2"    "TCP.2"

The dataset consists of 18 variables.

Let’s look at five samples of the data.

##      SIZE X.IPv4. SCTP IPv6 ESP   UDP GRE ICMP    TCP UNKNOWN X.IPv6.
## 968   988  149025    0    0   0  6963   0    0 142062       0    1452
## 1117 1137   66396    0    0   0 26690   0    0  39706       0    3180
## 625   645   67079    0    0   0  4239   0    0  62840       0    2509
## 408   428  178456    0    0  90  9961  17  121 168267       0    1303
## 949   969  109767    0    0   0  8116   0    0 101651       0    1608
##      ICMP6 UDP.1 TCP.1 X.IPv6t. ICMP6.1 UDP.2 TCP.2
## 968      0     0  1452        0       0     0     0
## 1117     0     0  3180        0       0     0     0
## 625      0    16  2493        0       0     0     0
## 408      0    11  1292        0       0     0     0
## 949      0     1  1607        0       0     0     0

Semantically, SIZE is different from other variables. It refers to size of a packet, while all other variables refer to the type of a packet.

The SIZE is quite uniform. Between close to 0 and 1500, the count is 1. This implies the almost all types of different size of packets did appear. Let’s look at the minimal 10 and maximal 10 of the size.

##  [1] 21 22 23 24 25 26 27 28 29 30
##  [1] 1498 1499 1500 1504 1632 1668 1740 1844 1848 4003

The minimal SIZE is 21. The uniform behavior does not show up when SIZE is beyond 1500. They only show up sparsely and has a very outlier value of 4003.

Now check the distribution of all the other “type” variables.

Linear scale does high concentration on low value and might be some outler of large value. Let’s look at log scale.

In terms of amount of data, TCP, TCP.1, UDP, UDP.1, X.IPv4., X.IPv6. appear to have more abundant data compared to the other types. Suggesting we can look at these variables first in the following analysis.

Univariate Analysis

Bivariate Plots Section

Given that SIZE is quite different from all others, and semantically we know it is about packet size, while all other variables are about packet types. The relationship to explore likely will be between SIZE and the rest of the 17 variable.

Let’s choose the first two variables to see how they look like.

The first pair of variables show small count of outliers in normal scale and make the main structure of the data difficult to examine.

Also, the data appear to be concentrated so transparency should be used.

With log10 scale on the packet counts “X.IPv4.” and alpha of 1/5 applied, the main structure of the dataset is much easier to comprehend.

Packet size also has outlier beyond value of 1500 and even so beyond 2000.

Eliminated data where size is more than 2000 gives better zoom in for the distribution. Now this looks like a slight negative correlation.

Linear regression gives better view of that negative correlation.

Now we choose another variable X.IPv6.

“X.IPv6.” shows similar slight negative correlation to Size.

Same negative correlation for IPv6t (IPv6 Tunneled) packets.

Previous we eliminated some data when SIZE is larger than 2000. Is there a way to get also good visualization but without cutting data?

Instead of cutting outliers for data with SIZE more than 2000, use log10 scale on SIZE also gives good visualization. This should be preferred method we don’t know yet if cutting data will cut off important information at this moment.

Let’s do the same log10 instead of cutting for the other two examples examined above.

Good. IPv6 can be visualized well using log10 instead of cutting.

Same as IPv6t.

Now let’s look at all varaibles to determine how we want to examine all of them.

##       SIZE           X.IPv4.               SCTP          
##  Min.   :  21.0   Min.   :        1   Min.   :0.0000000  
##  1st Qu.: 392.5   1st Qu.:    55732   1st Qu.:0.0000000  
##  Median : 764.0   Median :    88785   Median :0.0000000  
##  Mean   : 766.5   Mean   :  1180318   Mean   :0.0006725  
##  3rd Qu.:1135.5   3rd Qu.:   210714   3rd Qu.:0.0000000  
##  Max.   :4003.0   Max.   :446257816   Max.   :1.0000000  
##       IPv6              ESP              UDP                GRE          
##  Min.   :  0.000   Min.   :     0   Min.   :       0   Min.   :     0.0  
##  1st Qu.:  0.000   1st Qu.:     0   1st Qu.:    7558   1st Qu.:     0.0  
##  Median :  0.000   Median :     0   Median :   13430   Median :     1.0  
##  Mean   :  1.116   Mean   :  1902   Mean   :  130137   Mean   :   164.3  
##  3rd Qu.:  0.000   3rd Qu.:     0   3rd Qu.:   31028   3rd Qu.:     4.0  
##  Max.   :505.000   Max.   :442389   Max.   :71018929   Max.   :203249.0  
##       ICMP               TCP               UNKNOWN        
##  Min.   :     0.0   Min.   :        0   Min.   :  0.0000  
##  1st Qu.:     0.0   1st Qu.:    42163   1st Qu.:  0.0000  
##  Median :     0.0   Median :    71155   Median :  0.0000  
##  Mean   :  1653.6   Mean   :  1046460   Mean   :  0.6765  
##  3rd Qu.:    23.5   3rd Qu.:   149548   3rd Qu.:  0.0000  
##  Max.   :272798.0   Max.   :446027466   Max.   :360.0000  
##     X.IPv6.             ICMP6              UDP.1         
##  Min.   :       0   Min.   :     0.0   Min.   :     0.0  
##  1st Qu.:    1226   1st Qu.:     0.0   1st Qu.:     0.0  
##  Median :    1853   Median :     0.0   Median :     3.0  
##  Mean   :   37672   Mean   :   142.7   Mean   :   571.5  
##  3rd Qu.:    3633   3rd Qu.:     0.0   3rd Qu.:    18.0  
##  Max.   :22721559   Max.   :109201.0   Max.   :178140.0  
##      TCP.1             X.IPv6t.          ICMP6.1        
##  Min.   :       0   Min.   :  0.000   Min.   :  0.0000  
##  1st Qu.:    1200   1st Qu.:  0.000   1st Qu.:  0.0000  
##  Median :    1762   Median :  0.000   Median :  0.0000  
##  Mean   :   36958   Mean   :  1.024   Mean   :  0.4217  
##  3rd Qu.:    3446   3rd Qu.:  0.000   3rd Qu.:  0.0000  
##  Max.   :22721559   Max.   :505.000   Max.   :505.0000  
##      UDP.2              TCP.2         
##  Min.   :  0.0000   Min.   :  0.0000  
##  1st Qu.:  0.0000   1st Qu.:  0.0000  
##  Median :  0.0000   Median :  0.0000  
##  Mean   :  0.1231   Mean   :  0.4795  
##  3rd Qu.:  0.0000   3rd Qu.:  0.0000  
##  Max.   :115.0000   Max.   :285.0000

Certain types of packets dominate. Some types of packet has very small amount of counts.

##       SIZE    X.IPv4.       SCTP       IPv6        ESP        UDP 
##    1139779 1755133537          1       1660    2828143  193513127 
##        GRE       ICMP        TCP    UNKNOWN    X.IPv6.      ICMP6 
##     244369    2458952 1556086279       1006   56018871     212226 
##      UDP.1      TCP.1   X.IPv6t.    ICMP6.1      UDP.2      TCP.2 
##     849794   54956851       1523        627        183        713

Summing up the packet counts across different SIZE confirm the domination of certain protocols. For eample, X.IPv4., UDP, TCP, X.IPv6., UDP.1, TCP.1. have way larger values than others.

Bivariate Analysis

The data consists of 18 variables. The first variable “SIZE” is the size of packet and is used as index. Observing the first variable “X.IPv4.” found two characteristics:

With above modification and add a linear regression, there appears to be a negative correlation between the size and amount of packet for IPv4.

Doing the same virtualization on IPv6 “X.IPv6.”a nd IPv6 Tunnel “”X.IPv6t." also show similar negative correlation. The amount of packets appear to be IPv4 leading, IPv6 second, and IPv6 Tunnel only has very small amount. This is inlne with the landscape of Internet as IPv4 is the primary protocol used, while IPv6 is the new protocol but the transition is still happening. IPv6 Tunnel is a temporary technology used during transition period only.

Further observation found that although the data supplied 18 variables, the data is not tidy. Each column represent type of packet. Melt the type into a variable will allow analysis easier between type. Further, the types of packets are under three groups.

ICMP, UDP, TCP are the most important upper layer protocols for all types of IP packets. Other protocols like SCTP, ESP, GRE only has data in IPv4 group, and are not as important, as evidence by their total amount of packets compared to the amounts of ICMP, UDP, TCP’s. Comparison focus on ICMP, UDP, TCP, as well as total within each group should provide better insight.

Bivariate Plots Section (continued)

The data was wrangled to has new two new categorical variables. These two categorical variables will have hierarchical relationship. TYPE will be inner varaible and have value of either “TOTAL”, “TCP”, “UDP”, or “ICMP”. Another new categorical variable GROUP will be outer and have value of either “IPv4”, “IPv6”, or “IPv6t”. Finally, PACKET_COUNTS will contains the original value. SIZE is retained.

## [1] "SIZE"          "TYPE"          "PACKET_COUNTS" "GROUP"

After the data wrangling, amount of variables are reduced to four.

##       SIZE  TYPE PACKET_COUNTS GROUP
## 12993 1117 TOTAL             0 IPv6t
## 7481    66   TCP           141  IPv6
## 17131  794  ICMP             0 IPv6t
## 1201  1221 TOTAL         47411  IPv4
## 16776  439  ICMP             0 IPv6t

As the sample above, SIZE and PACKET_COUNTS retained most of the data. The 17 variables used to represent the type of packets are now identified by combination of two hierarchical, categorical variables: GROUP and TYPE.

Virsualization by Group of IPv4, IPV6, and IPv6t

Negative correlation between size and packet counts are consist with the single variable analysis. All types of packets has similar negative correlation although with some degree of difference.

## # A tibble: 4 x 4
##   TYPE      Mean Median        Sum
##   <fct>    <dbl>  <int>      <int>
## 1 TOTAL 1180318.  88785 1755133537
## 2 TCP   1046460.  71155 1556086279
## 3 UDP    130137.  13430  193513127
## 4 ICMP     1654.      0    2458952

Majority are TCP packets with a fraction of UDP packet, and far less of ICMP packets.

IPv6 UDP appears to have different relationship than other types.

## # A tibble: 4 x 4
##   TYPE    Mean Median      Sum
##   <fct>  <dbl>  <int>    <int>
## 1 TOTAL 37672.   1853 56018871
## 2 TCP   36958.   1762 54956851
## 3 UDP     571.      3   849794
## 4 ICMP    143.      0   212226

Majority are TCP packets with a fraction of UDP packet, and far less of ICMP packets.

Although the amount of packet counts are far less, IPv6t appears to have similar structure as IPv4.

## # A tibble: 4 x 4
##   TYPE   Mean Median   Sum
##   <fct> <dbl>  <int> <int>
## 1 TOTAL 1.02       0  1523
## 2 TCP   0.479      0   713
## 3 UDP   0.123      0   183
## 4 ICMP  0.422      0   627

Statistics show that IPv6t has far insignificant amount of counts compared to IPv4 and IPv6.

Visualization by Type pf packets of TOTAL, TCP, UDP, ICMP

Total amount of packets appear to have similar structure of slight negative correlation.

## # A tibble: 3 x 4
##   GROUP       Mean Median        Sum
##   <fct>      <dbl>  <int>      <int>
## 1 IPv4  1180318.    88785 1755133537
## 2 IPv6    37672.     1853   56018871
## 3 IPv6t       1.02      0       1523

Statistics shows that IPv4 far exceed IPv6, with IPv6t negligible.

TCP has the same slight negative correlation.

## # A tibble: 3 x 4
##   GROUP        Mean Median        Sum
##   <fct>       <dbl>  <int>      <int>
## 1 IPv4  1046460.     71155 1556086279
## 2 IPv6    36958.      1762   54956851
## 3 IPv6t       0.479      0        713

Same IPv4 >> IPv6 >> IPv6t in terms of counts.

Both IPv4 UDP and IPv6 UDP is quite unique compared to others. More in the analysis section.

## # A tibble: 3 x 4
##   GROUP       Mean Median       Sum
##   <fct>      <dbl>  <int>     <int>
## 1 IPv4  130137.     13430 193513127
## 2 IPv6     571.         3    849794
## 3 IPv6t      0.123      0       183

## # A tibble: 3 x 4
##   GROUP     Mean Median     Sum
##   <fct>    <dbl>  <int>   <int>
## 1 IPv4  1654.         0 2458952
## 2 IPv6   143.         0  212226
## 3 IPv6t    0.422      0     627

Bivariate Analysis (continued)

Analysis by group of IPv4, IPv6, and IPv6t

  • IPv4
    • The log scale packet counts can tell there is a wide variation between TCP, UDP and ICMP. The sum of there three types of packets confirm they do have log10 scale differences. TCP is by far the leading types of packets, followed by UDP, with ICMP a distance third.
    • For the negative correlation between packet counts and size, all types of packets do show negative correlation, however, with varying degree. ICMP has the largest negative correlation. Most of ICMP packets are small sizes. ICMP is a type of control packet and normally do not carry a lot of payload, so this makes a lot of sense. TCP is the main protocol today and all types of application with different charactersitics and different packet sizes ride on, so they tend to have less obviously negative correlation. UDP is interesting in that it has higher negative correlation than TCP, but also has a upward slop when of increasing packet counts when size is close and beyond 1000. This likely could be explained by two major types of UDP applications. First is DNS, which translates a domain name like www.udacity.com into an IP address. There are a lot of DNS traffic but their packet size tend to be small. Another major application of UDP is live video stream, which usually has larger packet size. The upward slope might be explained by the live video streaming application.
    • Except for some other additional types of packets we choosed to include due to their small count, the negative correlation of total types of packets do make sense as it combine the near flat of TCP and more downward slope of UDP and ICMP, so the slope of TOTAL is more steep than TCP.
    • Some interesting value of Medians. For example, median for ICMP is zero. This is because the packet counts value are indexed by packet size. Because of the type of ICMP is far less. So they tend to have similar packet size and the data indexed in this way will have a lot of zero when all packet size are listed.
  • IPv6
    • The domination of TCP protocol appear to be more pronounced in IPv6. TCP is close to 50 times of UDP and ICMP traffic combined.
    • TCP still represents the most amount of packets by far, with UDP the second, and ICMP a distance third.
    • The negative correlation are similar to IPv4. However, for UDP, it represent a far steeper downslope. This might be that the live video stream application metioned above only use IPv4 but not IPv6. As a result, there is not many large sized packets for UDP in IPv6.
    • Both UDP and ICMP have very small of zero median. Again this could be explained by not many different types of applications use these protocols so variety of packet size tend to be lacking.
  • IPv6t
    • The amounts of IPv6 tunnel is just too little for them to have meaning characteristic. Note that all meidan are zero. And the absolute amout of them are in the hundreds, dwarfed by the billion scale of IPv4 and and tens of million scale of IPv6.

Analysis by type of TOTAL, TCP, UDP, and ICMP

  • TOTAL
    • The domaniation of IPv4 is obvious when graphed by group. IPv6 is the second, and IPv6t very far distance 3rd.
    • The negative correlation within each group appear to be similar with about the same steepness of linear regression line slope.
  • TCP
    • The domaniation of IPv4 is obvious, followed by IPv6 and IPv6t very far distance 3rd.
    • The negative correlation within each group appear to be similar with about the same steepness of linear regression line slope.
  • UDP
    • The domaniation of IPv4 is obvious, followed by IPv6 and IPv6t very far distance 3rd.
    • The negative correlation within group of IPv4 and IPv6t appear to be similar.
    • UDP has very steep regression line. The lack of large size UDP packet could be explained as lack of use of live video streaming on top of IPv6 explained in previous section.
  • ICMP
    • The domaniation of IPv4 is obvious, followed by IPv6 and IPv6t very far distance 3rd.
    • The negative correlation within each group appear to be similar with about the same steepness of linear regression line slope.

Multivariate Plots Section

This section attempt to combine all variables to discover the relationship between them.

Plotting with point for one group (IPv4) and as line for another group (IPv6) show the scale of packet counts difference between.

Use point plot for while colored by Type and shaped by Group.

Reversed the color and shape to be colored by Group and shaped by Type. This has better visuability because of the scale of different between groups.

Given that the Type of TOTAL is redundant information, it’s just a sum of all types combined, remove it could make the plot less crowded. This is colored by Type and shaped by Group.

Same with TOTAL type removed and then colored by Group and shaped by Type.

##       SIZE  TYPE PACKET_COUNTS GROUP  GROUP_TYPE
## 12952 1076 TOTAL             0 IPv6t IPv6t TOTAL
## 16420   83  ICMP             0 IPv6t  IPv6t ICMP
## 388    408 TOTAL        123047  IPv4  IPv4 TOTAL
## 5852  1411  ICMP             0  IPv4   IPv4 ICMP
## 1946   479   TCP        193139  IPv4    IPv4 TCP

Concatenated the two categorial variables Group and Type into one.

The plot with concatenated group and type variables.

Multivariate Analysis

Three methods were attemped to plot two categorical variables on top of x, y scale two numeric variables.

Final Plots and Summary

Of all the graphing and analysis done, three major findings are summarized in different plots.

Plot One

Description One

By tuning the colors to be unique for each type and also has color family of orange for IPv4, green for IPv6, and blue for IPv6t, the unique characteristic of each type of packet for each group are visually more pronounced in a log10 scale graph of both packet size and packet counts.

It can be visualized that packet counts of IPv4 is more than IPv6, and packet counts of IPv6 more than IPv6t. This is one a log scale so the difference is even bigger than the visualization shows.

Also, each type of packet tends to less packet counts when the size is larger. Although the reduction in packet counts are different for each type.

Plot Two

Description Two

The two majority of the findings can be more easily vitualized by showing the combined packet counts value for each group of IPv4, IPv6, and IPv6t.

  • First, the difference betwen packet counts for each group is huge. Note that the graph is on log10 scale.
  • Second, all of the groups show a negative correlation between packet counts and packet size. Intuitively this makes senses, as if assuming the amount of data to be transmitted is the same, the larger the packet size, the less amount of packets are need to complete the transfer of the same amount of data.

Plot Three

Description Three

The negative correlation between packet counts and packet sizes does not some outliers that are different. Specifically the UDP packets for IPv4 and IPv6 both show a bit different behavior. For IPv4, the negative correlaction was broken a little when the packet size is close and beyound 1000. Implying a special type of application using UDP with large packet size is present in IPv4. Interesting, this type of packet might be absent from IPv6. Void of application with large size UDP packet, the amount large size UDP packets dropped even more significantly in IPv6.

UDP has major application in DNS which uses small packet size. It also has a major application in live video streaming. Video streaming tends to use large packet for efficiency of transmission. This observation implies that it could be that live video streaming application is present in IPv4 but not IPv6. Additional data not available in this study will be need to confirm this suspicion.

Reflection

Exploratory Data Analysis is fun in that you never know what you will find until the end of the process. Whether the data is useful or enough to provide insights. Through the process, we found the following:

Reference

This study references the various materials publicablly available on the Internet.